library(tidyverse)
library(BiocManager)
library(here)
library(dbscan)
library(rgl)
library(car)
library(BBmisc)
library(kableExtra)
library(tableone)
heart = read_csv(here::here("Data","heart.csv"))
Clustering is a unsupervised explanatory data mining technique. Effectively this means two things:
That being said, clustering is particularly useful when we suspect that a latent (hidden) variable exists in our data and we wish to get a better understanding on how it relates back to the original data.
There are many ways to go about clustering in terms of a workflow, here I’ll outline the most simple approach.
The choice of variables
First we must decide on what variables we are going to use to cluster.
Typically this choice is made on a low dimension (2 or 3) of variables that make sense for your data. For example:
The analysis that follows
Once we have the clusters, we can test the proportions of other key variables such as:
In general it is not easy to define a cluster, as a result, many types of clustering have been developed.
Centroid based clustering
In centroid clustering we first choose the number of clusters and then optimize the clusters such that each center of each cluster is assigned a set of points that minimizes the squared distance to each point from the clusters center.
Common algorithms include: k-means, k-medoids, and k-medians
Hierarchical clustering
In hierarchical clustering we aim to form a hierarchy by using both a distance measure and a linkage criteria.
The end result is a tree of clusters that can be visualized with a dendrogram.
Common algorithms include: single-link, complete-link, and Ward.
Distribution-based clustering
With distribution clustering we use distribution models as a baseline to form clusters. That is, we form a cluster based on how close a cluster of points fits a known distribution.
A common approach is the expectation-maximization (EM) algorithm where the assumption of Gaussian distributions is used.
Density-based clustering
In density-based clustering we don’t need to supply the number of clusters but rather relies on D-dimensional spheres to iteratively construct clusters.
Density-based spatial clustering of applications with noise (DBSCAN) is the most commonly used algorithm.
This report aims to verify the accuracy of DBSCAN, particularity when applied to clinical data.
Before we get into the details of DBSCAN, we must first discuss terminology used by density based clustering in general.
First, assume we take \(D\) variables to cluster, that is, we are clustering in a \(D\) dimensional space.
Given this, density based clustering depends on two parameters:
Each point is then classified into one of four categories:
Figure 1: Classification of density based points
We are at the point where we can talk about how the DBSCAN algorithm works. If look at it from an abstract point of view, we can break the algorithm into three main steps:
A few things to point out:
A quick example can be seen from a web app made from Naftali Harris.
You can test it out for yourself here! https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
Animation 1: DBSCAN on a smiley face with \(\epsilon = 1\) and \(minPts = 4\)
The dataset I’ve chosen to apply DBSCAN on is a clinical dataset that contains medical information regarding cardiovascular disease across 14 attributes over 303 subjects.
Data was collected at the Cleveland Clinic Foundation.
This data set in particular has been used to test and verify developments in machine learning algorithms and is currently stored on the UCI Machine Learning Repository.
It turns out that this data set has the information we need to test out DBSCAN:
The above six variables will be the ones we focus on.
Picking variables
First, we will start by taking two variables at a time. This makes visualization and interpretation easier.
That being said, we start of by picking a pair of variables from age, resting blood pressure, and serum cholesterol levels.
We then scale each variable. This ensures that equal weight is being applied to both attributes so that the variables with higher ranges (cholesterol or blood pressure) don’t dominate ones with lower ranges (age).
Obtain the best parameters
Next we will find the best values for \(\epsilon\) and minPts.
We set minPts to be 3 (one plus the dimension). This is the common approach, and although it implies that we may end up with small clusters (of size 3), optimizing \(\epsilon\) should prevent too many from being created.
To optimize \(\epsilon\) we produce a 3-Nearest Neighbor Distance Plot. This plot sorts the points by their minimum distance to some other point who is a core point. This distance is called the 3-nearest neighbor distance (3-NN distance). The sorted points are arranged on the x-axis, and the 3-NN distance are placed on the y-axis.
The optimized value for \(\epsilon\) is approximated by the x value which occurs at the middle of the bend (if it exists) in the 3-NN distance plot. The idea is that this value should separate distances that happen frequently over those that happen rarely. The end result will capture different clusters, even if they appear like they are close enough to be considered just one cluster.
Run DBSCAN
With these parameters we compute the cluster labels using DBSCAN and then project the labels back on to the (unscaled) data corresponding to the two variables we picked.
Test the clusters
When more than one cluster exists, we can test the proportions of the aforementioned variables.
First, we stratify by cluster group. Then we perform a chi square test for independence. In the case that one of our clusters is small, then we can instead perform a Fisher’s exact test.
If in the above we find something worth investigating (typically a p-value less than .2), we can go back to the original three variables (age, cholesterol, and blood pressure), and stratify by the variable we wish to further investigate.
The idea is that if the second test is significant for one of the variables, then we have discovered a pre-existing relation ship in our data using DBSCAN.
First we produce the 3-NN distance plot.
# Select the variables
heart_cluster = heart %>% dplyr::select(chol, age)
# Scale the variables
heart_cluster = heart_cluster %>% scale()
# Plot the 3-NN distances
kNNdistplot(heart_cluster, 3)
abline(h = .65, lty = 2)
Based on this plot, we pick \(\epsilon\) to be .65. Of course minPts is set to 3, as previously mentioned.
Now, we can perform DBSCAN and plot the clusters.
# Perform DBSCAN
scan = dbscan(heart_cluster,.65,3)
# Extract the cluster labels
clusters = scan$cluster
# Merge labels back to the orignal data
data = data.frame(heart,clusters)
# Plot
data %>% ggplot(aes(y=chol, x=age, color=as.factor(clusters))) + geom_point()+
scale_colour_manual(values=c("#D55E00","#0072B2","#E69F00"),
labels=c("Noise","Cluster 1", "Cluster 2"),
name = "Clusters") +
xlab("Age (Years)") + ylab("Serum Cholesterol (mg/dl)")
We see that DBSCAN has produced one large cluster, one smaller cluster, and some noise.
Let’s investigate a bit further.
# Filter out noise points
data_filt = data %>% filter(clusters!=0)
# Make categorical variables into factors.
data_filt = data_filt %>% mutate(cp=as.factor(cp),
target=as.factor(target),
sex=as.factor(sex),
clusters=as.factor(clusters))
# Give better names
data_filt = data_filt %>% mutate(sex=ifelse(sex==0,"Female","Male"),
target=ifelse(target==0,"No Heart Disease","Heart Disease"),
clusters=ifelse(clusters==1,"Cluster 1","Cluster 2")) %>%
rename(`Chest Pain Level` = cp, Sex=sex, Target=target)
# Create a summary table stratified on the clusters.
table_descriptives <-
CreateTableOne(data = data_filt, vars = c("Sex","Chest Pain Level","Target"), strata = "clusters", test=TRUE, testExact = "fisher.test")
kable(print(table_descriptives, missing = TRUE, showAllLevels = TRUE, print = FALSE,exact = c("Sex","Chest Pain Level","Target"))) %>% kable_styling(font_size = 9)
| level | Cluster 1 | Cluster 2 | p | test | Missing | |
|---|---|---|---|---|---|---|
| n | 293 | 3 | ||||
| Sex (%) | Female | 90 (30.7) | 3 (100.0) | 0.030 | exact | 0.0 |
| Male | 203 (69.3) | 0 ( 0.0) | ||||
| Chest Pain Level (%) | 0 | 137 (46.8) | 2 ( 66.7) | 1.000 | exact | 0.0 |
| 1 | 50 (17.1) | 0 ( 0.0) | ||||
| 2 | 83 (28.3) | 1 ( 33.3) | ||||
| 3 | 23 ( 7.8) | 0 ( 0.0) | ||||
| Target (%) | Heart Disease | 160 (54.6) | 2 ( 66.7) | 1.000 | exact | 0.0 |
| No Heart Disease | 133 (45.4) | 1 ( 33.3) |
So the smaller cluster contains 100 percent females!
We will look into this after the next pair of variables for reasons that will become obvious.
3-Nearest Neighbor Distance Plot
# Select the variables
heart_cluster = heart %>% dplyr::select(chol, thalach)
# Scale the variables
heart_cluster = heart_cluster %>% scale()
# Plot the 3-NN distances
kNNdistplot(heart_cluster, 3)
abline(h = .65, lty = 2)
Cluster Output from DBSCAN
# Perform DBSCAN
scan = dbscan(heart_cluster,.65,3)
# Extract the cluster labels
clusters = scan$cluster
# Merge labels back to the orignal data
data = data.frame(heart,clusters)
# Plot
data %>% ggplot(aes(y=chol, x=thalach, color=as.factor(clusters))) + geom_point()+
scale_colour_manual(values=c("#D55E00","#0072B2","#E69F00"),
labels=c("Noise","Cluster 1", "Cluster 2"),
name = "Clusters") +
xlab("Resting Blood Pressure") + ylab("Serum Cholesterol (mg/dl)")
Summary Statistics and Fisher Exact tests across clusters
# Filter out noise points
data_filt = data %>% filter(clusters!=0)
# Make categorical variables into factors.
data_filt = data_filt %>% mutate(cp=as.factor(cp),
target=as.factor(target),
sex=as.factor(sex),
clusters=as.factor(clusters))
# Give better names
data_filt = data_filt %>% mutate(sex=ifelse(sex==0,"Female","Male"),
target=ifelse(target==0,"No Heart Disease","Heart Disease"),
clusters=ifelse(clusters==1,"Cluster 1","Cluster 2")) %>%
rename(`Chest Pain Level` = cp, Sex=sex, Target=target)
# Create a summary table stratified on the clusters.
table_descriptives <-
CreateTableOne(data = data_filt, vars = c("Sex","Chest Pain Level","Target"), strata = "clusters", test=TRUE, testExact = "fisher.test")
# Print table
kable(print(table_descriptives, missing = TRUE, showAllLevels = TRUE, print = FALSE,exact = c("Sex","Chest Pain Level","Target"))) %>% kable_styling(font_size = 9)
| level | Cluster 1 | Cluster 2 | p | test | Missing | |
|---|---|---|---|---|---|---|
| n | 297 | 4 | ||||
| Sex (%) | Female | 91 (30.6) | 4 (100.0) | 0.009 | exact | 0.0 |
| Male | 206 (69.4) | 0 ( 0.0) | ||||
| Chest Pain Level (%) | 0 | 139 (46.8) | 3 ( 75.0) | 0.872 | exact | 0.0 |
| 1 | 50 (16.8) | 0 ( 0.0) | ||||
| 2 | 85 (28.6) | 1 ( 25.0) | ||||
| 3 | 23 ( 7.7) | 0 ( 0.0) | ||||
| Target (%) | Heart Disease | 162 (54.5) | 2 ( 50.0) | 1.000 | exact | 0.0 |
| No Heart Disease | 135 (45.5) | 2 ( 50.0) |
As with the first pair, we see that the smaller cluster contains 100 percent females too.
Let’s look into this further by create another summary table, but stratified on sex. We will then look at the original three continuous variables.
# Give better names
data_format = heart %>% rename(`Age (Years)`=age,
`Serum Cholesterol (mg/dl)`=chol,
`Resting Blood Pressure`=trestbps) %>%
mutate(sex=ifelse(sex==0,"Female","Male"),target=ifelse(target==0,"No Heart Disease","Heart Disease"))
# Create table
table_descriptives <-
CreateTableOne(data = data_format, vars = c("Age (Years)","Serum Cholesterol (mg/dl)","Resting Blood Pressure"), strata = "sex", test=TRUE)
# Print table
kable(print(table_descriptives, missing = TRUE, showAllLevels = TRUE, print = FALSE)) %>% kable_styling(font_size = 9)
| level | Female | Male | p | test | Missing | |
|---|---|---|---|---|---|---|
| n | 96 | 207 | ||||
| Age (Years) (mean (SD)) | 55.68 (9.41) | 53.76 (8.88) | 0.087 | 0.0 | ||
| Serum Cholesterol (mg/dl) (mean (SD)) | 261.30 (65.09) | 239.29 (42.78) | 0.001 | 0.0 | ||
| Resting Blood Pressure (mean (SD)) | 133.08 (19.31) | 130.95 (16.66) | 0.325 | 0.0 |
Thus, we can see that the smaller cluster is uncovering a relationships that exist in our data set.
That is:
3-Nearest Neighbor Distance Plot
# Select the variables
heart_cluster = heart %>% dplyr::select(age, thalach)
# Scale the variables
heart_cluster = heart_cluster %>% scale()
# Plot the 3-NN distances
kNNdistplot(heart_cluster, 3)
abline(h = .35, lty = 2)
Cluster Output from DBSCAN
# Perform DBSCAN
scan = dbscan(heart_cluster,.35,3)
# Extract the cluster labels
clusters = scan$cluster
# Merge labels back to the orignal data
data = data.frame(heart,clusters)
# Plot
data %>% ggplot(aes(y=thalach, x=age, color=as.factor(clusters))) + geom_point() +
scale_colour_manual(values=c("#D55E00","#0072B2","#E69F00"),
labels=c("Noise", "Cluster 1","Cluster 2"),
name = "Clusters") +
ylab("Resting Blood Pressure") + xlab("Age (Years)")
This time we obtain a larger second cluster.
Summary Statistics and Fisher Exact tests across clusters
# Filter out noise points
data_filt = data %>% filter(clusters!=0)
# Make categorical variables into factors.
data_filt = data_filt %>% mutate(cp=as.factor(cp),
target=as.factor(target),
sex=as.factor(sex),
clusters=as.factor(clusters))
# Give better names
data_filt = data_filt %>% mutate(sex=ifelse(sex==0,"Female","Male"),
target=ifelse(target==0,"No Heart Disease","Heart Disease"),
clusters=ifelse(clusters==1,"Cluster 1","Cluster 2")) %>%
rename(`Chest Pain Level` = cp, Sex=sex, Target=target)
# Create a summary table stratified on the clusters.
table_descriptives <-
CreateTableOne(data = data_filt, vars = c("Sex","Chest Pain Level","Target"), strata = "clusters", test=TRUE, testExact = "fisher.test")
# Print table
kable(print(table_descriptives, missing = TRUE, showAllLevels = TRUE, print = FALSE,exact = c("Sex","Chest Pain Level","Target"))) %>% kable_styling(font_size = 9)
| level | Cluster 1 | Cluster 2 | p | test | Missing | |
|---|---|---|---|---|---|---|
| n | 281 | 9 | ||||
| Sex (%) | Female | 89 (31.7) | 3 (33.3) | 1.000 | exact | 0.0 |
| Male | 192 (68.3) | 6 (66.7) | ||||
| Chest Pain Level (%) | 0 | 129 (45.9) | 6 (66.7) | 0.761 | exact | 0.0 |
| 1 | 45 (16.0) | 1 (11.1) | ||||
| 2 | 84 (29.9) | 2 (22.2) | ||||
| 3 | 23 ( 8.2) | 0 ( 0.0) | ||||
| Target (%) | Heart Disease | 157 (55.9) | 2 (22.2) | 0.084 | exact | 0.0 |
| No Heart Disease | 124 (44.1) | 7 (77.8) |
and find that 78% of the subjects in this cluster do not have a heart disease. Let’s investigate further.
# Create a summary table stratified on the presence of heart disease
table_descriptives <-
CreateTableOne(data = data_format, vars = c("Age (Years)","Serum Cholesterol (mg/dl)","Resting Blood Pressure"), strata = "target", test=TRUE)
# Print
kable(print(table_descriptives, missing = TRUE, showAllLevels = TRUE, print = FALSE)) %>% kable_styling(font_size = 9)
| level | Heart Disease | No Heart Disease | p | test | Missing | |
|---|---|---|---|---|---|---|
| n | 165 | 138 | ||||
| Age (Years) (mean (SD)) | 52.50 (9.55) | 56.60 (7.96) | <0.001 | 0.0 | ||
| Serum Cholesterol (mg/dl) (mean (SD)) | 242.23 (53.55) | 251.09 (49.45) | 0.139 | 0.0 | ||
| Resting Blood Pressure (mean (SD)) | 129.30 (16.17) | 134.40 (18.73) | 0.012 | 0.0 |
Thus, similar to the first 2 pairs, we also find that the smaller cluster is revealing relationships that exist in the data:
Overall, we conclude that DBSCAN is doing a relatively good job at giving us cluster that have meaning so long that we provide optimized parameters.
While DBSCAN is great, because we don’t have to give it the number of clusters before hand, if one is not careful in deciding appropriate values for \(\epsilon\) and minPts, they’d end up with either many uninterpretable clusters or just one large cluster.
Secondly, if one uses Euclidean distance as a distance measure (which is most likely the case), then performing DBSCAN at higher dimensions will likely result is no clusters at all. This is known as the curse of dimensional where the Euclidean distance between points becomes smaller in a higher dimensional space.
Lastly, one of the biggest struggles for DBSCAN is being able to detect clusters that vary in terms of denseness. This mostly comes down to the fact that we wouldn’t be able to find that sweet-spot value for \(\epsilon\) (no obvious bend in the k-NN distance plot).
One nice thing about DBSCAN is that it only assumes that the clusters we wish to form consist of a high density of observations.
This means that if your data points are spread out enough, DBSCAN might only return one cluster and some noise. This, however, may not necessarily be a bad thing, as it could indicate that no clusters should exist in your data in the first place.
We’ve discussed other clustering methods, however there are many extensions to the DBSCAN algorithm
Ordering points to identify the clustering structure (OPTICS)
OPTICS is an density based algorithm that aims to fix DBSCAN’s limitation of not being able to detect clusters with various densities.
It does this by measuring two special distance measures called a core distance and reachability distance. The points are then sorted by their reachability distance per cluster and a reachability-plot is produced. The more narrow a curve on the reachability-plot is, the more dense the cluster corresponding to that curve is.
Figure 2: Example of the OPTICS workflow
Note that the orange colored points and curves represnt noise.
Hierarchical DBSCAN (HDBSCAN)
HDBSCAN uses reachability distance similar to OPTICS, but also integrates concepts from hierarchical based clustering as well.
The end goal is to produce a condensed cluster tree and use it to obtain the cluster labels.
Figure 3: Example of a condensed cluster tree
Note that clusters can not be chosen as descendants from each other.
When density based clustering is used, typically DBSCAN is the go to algorithm.
In cases where your data has various clusters of different densities, then using either OPTICS or HDBSCAN is viable, especially since \(\epsilon\) no longer needs to be provided by the user.
In the case where you have a special kind of data, like images or geographical data. Then a more advanced, and perhaps specialized algorithm is what may be needed.
Below is a list of some relevant applications related to the clinical field.
In general, having highly dense data and the appropriate variables to cluster on is usually enough for DBSCAN to be applicable.
There is one paper that I found interesting and would like to go over in more detail.
It involves using DBSCAN to cluster on data relating to sound waves of subjects who are talking and swallowing, some of which may have troubles swallowing.
The workflow is as follows:
Start by recording the 2-dimensional vibration data over some period of time. Data is collected once every millisecond.
For each axis, two variables are computed in 200ms intervals:
The choice of these variables are significant since other studies had shown that while no swallows are detected, the values of these variables remain at some constant baseline value, but will rise by quite some amount once a swallow is present.
The end result is a cluster signal where a high signal indicates a swallow, while a low signal indicates noise or no swallow. This cluster signal was projected back onto the waveform data and compared to cut offs set by a specialist.
Figure 4: Comparison of DBSCAN for healthy and unhealthy swallows
As a result of doing research on density based clustering, I’ve thought of one interesting direction to move forward.
An interactive (Shiny) app that includes all the methods algorithms in this report, plus more.
There have been many packages and frameworks created already to gather many clustering algorithms together, this app would take a different approach.
In particular, a user could upload a dataset and then interactively and dynamically select various algorithms and the parameters corresponding to them to quickly compare the outputs.
In cases like DBSCAN where parameters should be optimized, tips and visual guides can be provided to make this choice.
In cases like OPTICS and HDBSCAN, corresponding plots can be produced (reachability-plots and condensed cluster tree plots).
Finally, detailed documentation for each method and algorithm would be provided to help guide the user to pick the best approach for their data.
Overall this tool could be used by anyone (with or without programming experience) to effectively explore clustering with their data.
Overall, density based clustering is a useful unsupervised data mining technique with lots of potential, especially in the clinical field. We showed one example with a simple workflow that managed to uncover existing relationships in the data using DBSCAN. We also learned about the limitations of DBSCAN, and some alternatives that aim to fix these limitations. Finally we dove into some applications and how DBSCAN is currently being used in practice.
Tianyong Hao,Alexander Rusanov,Mary Regina Boland,Chunhua Weng, “Clustering clinical trials with similar eligibility criteria features”; Journal of Biomedical Informatics; Elsevier; December 2014.
F. Baselice, L. Coppolino, S. D’Antonio, G. Ferraioli and L. Sgaglione, “A DBSCAN based approach for jointly segment and classify brain MR images,” 2015 37th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Milan, 2015, pp. 2993-2996.
V. U. Panchami and N. Radhika, “A novel approach for predicting the length of hospital stay with DBSCAN and supervised classification algorithms,” The Fifth International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2014), Bangalore, 2014, pp. 207-212.
Joshua M. Dudik, Atsuko Kurosu, James L Coyle, and Ervin Sejdić. A Comparative Analysis of DBSCAN, K-Means, and Quadratic Variation Algorithms for Automatic Identification of Swallows from Swallowing Accelerometry Signals; 2015 April 1; 59: 10–18